Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor grid default boxes with torch meshgrid #3799

Merged
merged 8 commits into from
May 11, 2021

Conversation

zhiqwang
Copy link
Contributor

@zhiqwang zhiqwang commented May 8, 2021

As mentioned in #3403 (comment) , I refactor the for-loop default boxes generation as following with torch.meshgrid.

for k, f_k in enumerate(grid_sizes):
# Now add the default boxes for each width-height pair
for j in range(f_k[0]):
if self.steps is not None:
y_f_k = image_size[1] / self.steps[k]
else:
y_f_k = float(f_k[0])
cy = (j + 0.5) / y_f_k
for i in range(f_k[1]):
if self.steps is not None:
x_f_k = image_size[0] / self.steps[k]
else:
x_f_k = float(f_k[1])
cx = (i + 0.5) / x_f_k
default_boxes.extend([[cx, cy, w, h] for w, h in self._wh_pairs[k]])

I tested the consumption time of torch.meshgrid and for-loop methods locally using torch.utils.benchmark.Timer (Just test the default_boxes generation part).

On CUDA device:

for-loop (Currently used):
  2.11 ms
  1 measurement, 100 runs , 1 thread

torch.meshgrid:
  1.61 ms
  1 measurement, 100 runs , 1 thread

On CPU device:

for-loop (Currently used):
  1.90 ms
  1 measurement, 100 runs , 1 thread

torch.meshgrid:
  1.17 ms
  1 measurement, 100 runs , 1 thread

EDIT: The parameters I am using is below:

device = torch.device('cuda')

# Define the DefaultBoxGenerator for VGG16
aspect_ratios = [[2], [2, 3], [2, 3], [2, 3], [2], [2]]
scales = [0.07, 0.15, 0.33, 0.51, 0.69, 0.87, 1.05]
steps = [8, 16, 32, 64, 100, 300]
dbox_generator = DefaultBoxGenerator(aspect_ratios, scales=scales, steps=steps)
dbox_generator = dbox_generator.to(device)
dbox_generator.eval()

# Forward of DefaultBoxGenerator
images = torch.zeros(2, 3, 300, 300).to(device)
feature_shapes =  [(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)]
features = [torch.zeros(2, 8, fx, fy).to(device) for (fx, fy) in feature_shapes]
image_shapes = [i.shape[-2:] for i in images]
images = ImageList(images, image_shapes)

grid_sizes = [feature.shape[-2:] for feature in features]
image_size = images.tensors.shape[-2:]
dtype = features[0].dtype

default_boxes = dbox_generator.grid_default_boxes(grid_sizes, image_size, dtype=dtype)

cc @fmassa @datumbox

y_f_k, x_f_k = f_k

shifts_x = (torch.arange(0, f_k[1], dtype=dtype) + 0.5) / x_f_k
shifts_y = (torch.arange(0, f_k[0], dtype=dtype) + 0.5) / y_f_k
Copy link
Contributor Author

@zhiqwang zhiqwang May 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the default_boxes are generated on the CPU device and then migrated to the CUDA device. I've tried the following method to generate the default_boxes directly on CUDA device, but It will take longer than the for-loop method.

shifts_x = (torch.arange(0, f_k[1], device=torch.device('cuda'), dtype=dtype) + 0.5) / x_f_k
shifts_y = (torch.arange(0, f_k[0], device=torch.device('cuda'), dtype=dtype) + 0.5) / y_f_k

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've hit cases in the past where micro-benchmarks on exactly this part of the code could be faster if running on the CPU, but would present significant slowdowns when training on multiple GPUs. Even if this might be slower on micro-benchmarks if run on a single GPU, it might still be faster on multiple GPUs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmassa Thus you recommend passing the target device to this method and putting them right away in there, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave it as is for now, but I would create a follow-up issue to benchmark this and the other configuration on multiple GPUs to verify

Copy link
Contributor Author

@zhiqwang zhiqwang May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I've tested the inferring consumption time of the total COCO eval datasets betweed this two default boxes generations methods on different device, the consumption time of these two is very similar.

Validated with (using one card):

CUDA_VISIBLE_DEVICES=0 python train.py --dataset coco --model ssd300_vgg16 \
    --batch-size 16 --pretrained --test-only

@datumbox datumbox self-requested a review May 9, 2021 15:14
Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall approach looks great to me, thanks @zhiqwang !

I'll let @datumbox double-check the rest of the logic in the function, but I've made a few minor comments that could potentially help with the speed as well

y_f_k, x_f_k = f_k

shifts_x = (torch.arange(0, f_k[1], dtype=dtype) + 0.5) / x_f_k
shifts_y = (torch.arange(0, f_k[0], dtype=dtype) + 0.5) / y_f_k
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've hit cases in the past where micro-benchmarks on exactly this part of the code could be faster if running on the CPU, but would present significant slowdowns when training on multiple GPUs. Even if this might be slower on micro-benchmarks if run on a single GPU, it might still be faster on multiple GPUs

torchvision/models/detection/anchor_utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhiqwang Thanks a lot for the PR!

I confirmed that using your changes, the statistics of the model remain the same. I left a couple of comments for your consideration but overall it looks good to me. Let me know what you think.

torchvision/models/detection/anchor_utils.py Outdated Show resolved Hide resolved
torchvision/models/detection/anchor_utils.py Outdated Show resolved Hide resolved
y_f_k, x_f_k = f_k

shifts_x = (torch.arange(0, f_k[1], dtype=dtype) + 0.5) / x_f_k
shifts_y = (torch.arange(0, f_k[0], dtype=dtype) + 0.5) / y_f_k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmassa Thus you recommend passing the target device to this method and putting them right away in there, right?

torchvision/models/detection/anchor_utils.py Show resolved Hide resolved
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the improvement @zhiqwang!

@datumbox datumbox merged commit 48441cc into pytorch:master May 11, 2021
@fmassa
Copy link
Member

fmassa commented May 11, 2021

Thanks a ton @zhiqwang !

@zhiqwang
Copy link
Contributor Author

Hi @fmassa and @datumbox , thanks also for your guidance here!

@zhiqwang zhiqwang deleted the refactor-grid-boxes branch May 11, 2021 09:25
@datumbox
Copy link
Contributor

datumbox commented May 11, 2021

@zhiqwang I'm afraid that the PR might have broke a branch that is not currently taken on master but it's needed for follow up work. Could you please have a look?
https://app.circleci.com/pipelines/github/pytorch/vision/7996/workflows/f8ff236b-a8d8-4dca-911c-ab8bc3226aa2/jobs/569327

Edit: To avoid the back-and-forth I tried the following fix directly on the other PR. Please have a look and let me know what you think: ea1e2c4

model_name = 'ssd512_resnet50', dev = device(type='cuda')

    @pytest.mark.parametrize('model_name', get_available_detection_models())
    @pytest.mark.parametrize('dev', _devs)
    def test_detection_model(model_name, dev):
>       ModelTester()._test_detection_model(model_name, dev)

test/test_models.py:452: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/test_models.py:223: in _test_detection_model
    out = model(model_input)
env/lib/python3.8/site-packages/torch/nn/modules/module.py:1015: in _call_impl
    return forward_call(*input, **kwargs)
torchvision/models/detection/ssd.py:335: in forward
    anchors = self.anchor_generator(images, features)
env/lib/python3.8/site-packages/torch/nn/modules/module.py:1015: in _call_impl
    return forward_call(*input, **kwargs)
torchvision/models/detection/anchor_utils.py:239: in forward
    default_boxes = self._grid_default_boxes(grid_sizes, image_size, dtype=dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = DefaultBoxGenerator(aspect_ratios=[[2], [2, 3], [2, 3], [2, 3], [2], [2]], clip=True, scales=[0.07, 0.15, 0.33, 0.51, 0.69, 0.87, 1.05], steps=None)
grid_sizes = [torch.Size([32, 32]), torch.Size([16, 16]), torch.Size([8, 8]), torch.Size([4, 4]), torch.Size([2, 2]), torch.Size([1, 1])]
image_size = torch.Size([512, 512]), dtype = torch.float16

    def _grid_default_boxes(self, grid_sizes: List[List[int]], image_size: List[int],
                            dtype: torch.dtype = torch.float32) -> Tensor:
        default_boxes = []
        for k, f_k in enumerate(grid_sizes):
            # Now add the default boxes for each width-height pair
            if self.steps is not None:
                x_f_k, y_f_k = [img_shape / self.steps[k] for img_shape in image_size]
            else:
                y_f_k, x_f_k = f_k
    
>           shifts_x = (torch.arange(0, f_k[1], dtype=dtype) + 0.5) / x_f_k
E           RuntimeError: "arange_cpu" not implemented for 'Half'

torchvision/models/detection/anchor_utils.py:209: RuntimeError

facebook-github-bot pushed a commit that referenced this pull request May 17, 2021
Summary:
* Refactor grid default boxes with torch.meshgrid

* Fix torch jit tracing

* Only doing the list multiplication once

* Make grid_default_box private as suggested

* Replace list multiplication with torch.repeat

* Move the clipping into _grid_default_boxes to accelerate

Reviewed By: datumbox

Differential Revision: D28473335

fbshipit-source-id: 0e6705ecb9a80fc78213ac1e6da1b02eb3381652

Co-authored-by: Francisco Massa <[email protected]>
Co-authored-by: Vasilis Vryniotis <[email protected]>
Co-authored-by: Francisco Massa <[email protected]>
Co-authored-by: Vasilis Vryniotis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants